BIOST 561: unit testing
Lecture 5
Announcements
- Your HW2’s look great – I’ll be grading them this weekend
- You probably got an email from
UW Course Evaluations via IASystem Notification about this
course’s midterm course feedback. It would mean a lot to me if you filed
it out
Why unit testing?
- A unit test is additional code you write to verify
some quality about your function
- Then, there’s a way to automatically run all your unit tests
- This concept extends well beyond R, but we’ll be learning how it
works in R for this course
- You want some confidence that when you use your function in “real”
situations, your code is doing what you think it’s doing
- Your code evolves over time, so you want to make sure you don’t
introduce extra bugs
- Your unit tests are, in a way, documentation
- Your are “documenting” what is the expected behavior of your
functions
- You want some confidence that when you use your function in “real”
situations, your code is doing what you think it’s doing
- Your code evolves over time, so you want to make sure you don’t
introduce extra bugs
- Your unit tests are, in a way, documentation
- Your are “documenting” what is the expected behavior of your
functions
- (My personal opinion): If you do not know how to test your
code, then you don’t really understand your own code
- Corollary: If you don’t know how to design a setting where you know
what should happen, then you don’t really understand your own code
Your Q4 in HW2
Suppose your teammate gave you a function to find the maximal
clique in an adjacency matrix (i.e., the set of nodes that
forms the largest clique). You are not told the typical size and
characteristics of these adjacency matrices beforehand. Your job is to
make sure this function is correct since you and your teammate are about
to give this function to your manager, who will then give it to another
division in your company to use. Your performance review will depend
highly on whether or not other people in your company can reliably use
your function.
In a short paragraph, write down ways to ensure your teammate’s
function is “correct.” Please list at least four different ways you can
test this function. You can interpret this notion of “correct” very
liberally – this question is purposely framed to be open-ended.
5-10 minute discussion with your tablemates
- What did you write for this question?
5-10 minute discussion with your tablemates
- What did you write for this question?
- What “foreseeable concerns/nightmares” are you trying to avoid by
writing unit tests?
Types of unit tests
- In the next few slides, I copy-pasted (more-or-less verbatim) what
you guys wrote
- All I did was organize them into categories
Checking that the function outputs something that is the
correct type
- “Try passing something that is not a matrix, like a data frame,
through the function and see if it returns an answer. If it does, it is
not correctly checking the data structure that needs to be used in the
function.”
- “For format tests, we may test if input and output always returns
correct format, and check if there are inconsistencies among different
tables/tibbles.”
- “Give the function a sample input and check its output.”
Simple checks to make sure result is within the correct
range
- “Test if function is giving out the expected value for the Maximal
Clique for a generated small graph.”
- “We could also add a check to ensure the inputs and output are the
data types we would expect.”
- “First I would make sure that the function returned the right answer
for normal-ish matrixes that I would expect it would be given.”
- “Second, I may give it a fully connected graph and check if it
returns all the nodes.”
Making sure the function runs on many different inputs (or
same input, if there’s randomness in the method)
- “Repeat test on the same adjacency matrix, to check if the output is
the same.”
- “We could generate graphs with a known maximal clique, using say
generate_random_graph with a low density and different
sizes of cliques.”
- “We might also want to check the behavior of the function if
different for the adjacency matrix are used.”
- “I’d also try random graphs where there aren’t any big cliques, just
to make sure it doesn’t find something that isn’t there.”
Making sure the function gets the correct answer for
carefully crafted problem
- “Again, use a random adjacency matrices with very low edge density
(no necessary to be the same as method 2), manually insert a planted
clique, as big as enough to ensure it is the largest. Then, run the
function to check if it can detect.”
- “We could also manually generate graphs and compute the maximal
clique.”
- “Then, to see if the result is correct, the hardcore way is to
compare the truth to function’s result.”
- “Come up with specific examples of graphs with known cliques and
make sure the function can return the set of nodes that form the maximal
cliques.”
- “First, I’d use really small graphs where I can figure out the
biggest clique by hand and see if the function matches.”
- “I will create graphs that have cliques of different sizes and make
sure it picks the biggest one.”
- “I think I will just create adjacency matrixs with known maximal
clique, and see if the function can return the correct answer.”
- “First test it on some tiny graphs where I already know what the
largest clique should be, like a fully connected group of 5 nodes.”
Making sure the function gets the correct answer with many
randomly generated problems
- “Randomly add or remove a few edges from an adjacency matrix, to
check if the function can detect the same main clique as the original
one.”
- “Create random adjacency matrices with very high edge density, to
check if the function can find the correct large clique.”
Stress testing to make sure the function handles corner cases
gracefully
- “Check a test matrix that has no cliques and see if it returns an
answer. If it does, that means it is not correct in identifying
cliques.”
- “Test on graphs with no cliques larger than 2 as an edge case.”
- “Test on an empty graph.”
- “Then I would move into some edge cases. A matrix with all zeros. A
matrix that is 1x1. A matrix that has NaNs. A matrix that is insanely
huge.”
- “We should first test extreme cases, if this `find_max_clique” runs
on 1x1 matrix, and then goes to more extreme cases like 100k x 100k.
Also like all connected matrix and no connectinos at all.”
- “Come up with specific examples of graphs with NO cliques and make
sure the function returns an empty set.”
- “Pass in a fully disconnected graph and make sure the function
returns an empty set”
- “Use some corner cases (very small matrix and very large matrix with
many 1s or 0s) to test the function.”
- “I may also create some matrixs with no connection to each other and
see if the function correctly distinguish it”
Comparing against another known implementation
- “Cross compare with other/ previously established methods that is
known to work.”
- (This is 2024 student): “Use the function in conjunction with other
clique-finding algorithms or software, where possible, to cross-verify
results. Discrepancies can help identify specific situations where the
function may fail.”
Comparing against another one of your implementations
(possibly much more computationally intensive, but more transparent and
definitely correct)
- (This is 2024 student): “The third way is to write another algorithm
by myself, testing it by inputting several randomly generated adjacency
matrix to find out the maximal clique, and compare my results with the
results generated from my teammate’s function.”
Math: Exploiting some mathematical property of your
problem
- “Pass a test matrix through that has two large cliques, but one that
is just slightly larger than the other. If the function incorrectly
identifies the other clique as the maximal clique, it is
incorrect.”
- “I’ll also double-check that whatever it returns is actually a real
clique, meaning all the nodes are connected to each other.”
- “I would also shuffle the rows and columns of the adjacency matrix
to make sure the function still finds the same clique, because the order
shouldn’t matter”
Testing the timing and memory of your function
- “Pass larger and larger matrices through the function to make sure
it can handle larger calculations without overflowing.”
- “If this”correct” include efficiency, which it does not blows in
time for expected size of data, then it is more of looking at the
structure of code, method used.”
Testing to make sure it errors when expected
- “Then I would move to more error handling, type checking, etc, once
I knew the algorithm worked well. This would involve giving the function
strings, scalers, other function, and other weird stuff.”
- “Pass in incorrect arguments to the function (e.g wrong type of
data) and make sure the function either warns the user or throws an
error”
- “Use incorrect input to check the function’s output.”
Testing the coding environment
- “Check if this function runs correctly when called in other
projects.”
- (This is 2024 student): “I will also check if this function can run
well in different environments, which means that it doesn’t rely on
something only in my colleague’s computer.”
Checking the documentation/literal coding
- (This is 2024 student): “With the help of
help() in R
to make sure the description and the code is with the same logic.”
- (This is 2024 student): “You can make sure they follow the company’s
style guide and coding grammar.”
- (This is 2024 student): “I might consider the function’s readability
as a relevant consideration, because if the code is a gargled mess
people in the company will have difficulty using it.”
Checking the intermediary functions!
- “If this ‘correct’ include perspective of checking the algorithm, we
can test infinite loops, correct intermediate steps. If it include tree
structures I believe there are correctness test for those too.”
- (This is 2024 student): “A more tedious method is to run the
intermediate functions one at a time to verify if each step of our new
function adheres to the intended logic and generate the desired results
independently.”
Testing the behavior when there is deliberately no unique
answer
- (This is 2024 student): “Check the output when graphs contain
multiple equally sized maximal cliques.”
So… what does this show?
- You actually know quite a lot of ways to make sure
your code is correct!
- As you can see, there’s (broadly speaking) two main categories of
tests
- Completion: Does my code run and terminate when it
receives different inputs?
- Correctness: Is my code giving the correct
output?
- Many other things you discussed aren’t “tests” that can be easily
automated. (More on this later!)
- There’s only one “type” of unit tests that I use that I did not see
people mention in their homeworks (unless I missed it when looking at
your HW2s)
What’s the goal?
- A unit test is a short script you write that:
- “has a function do something”
- “checks the output”
- This is to ensure that the function “works”

You ideally should be writing unit tests for each function you
write. The more you “refactor” your functions (into more manageable
smaller functions), the easier you’ll know what tests to write.
R Studio and R packages are convenient for easily running all
your unit tests, so you can easily assess how “stable” your codebase
is.
This can be painful!! Whenever you update your method, you might
need to update your unit tests. However, this should be a “rolling
experience” – as you find new bugs in your code, you should be writing
more unit tests.
A personal note
In my experience, most PhD students I talk to cannot be bothered
to write unit tests since it’s a pain to write/maintain all this
“additional code.”
At the same time, in my experience, it’s not a matter of “whether
or not my codebase crashes,” but rather a question of “what time
in the future will my codebase become so complicated that everything
starts to fail at the same time?”
A personal note
In my experience, most PhD students I talk to cannot be bothered
to write unit tests since it’s a pain to write/maintain all this
“additional code.”
At the same time, in my experience, it’s not a matter of “whether
or not my codebase crashes,” but rather a question of “what time
in the future will my codebase become so complicated that everything
starts to fail at the same time?”
After you’ve had your first traumatic experience of your
codebase failing, come back to these slides and learn more
about how to write unit-tests so you don’t need to relive the trauma in
the future.
Setting up unit tests in R
- There’s a lot of “rules” on how to set up unit tests in an R
package
- Let’s take a look at the demo package for this course: https://github.com/linnykos/561_s2025_example
(Screenshot from the 2024 version)
- You’ll notice a
tests folder:

- This is where all your tests will live – it’s just another
specifically named R folder
- It must be a folder called
tests

- If we look at the
testthat.R folder, you’ll see that
it’ll look like this
- Every
testthat.R file you ever write will only have
these specifically these three lines.
- You would just replace
UW561S2025Example in both places
with your package name (for your homeworks, this would be
UWBiost561)
library(testthat)
library(UW561S2025Example)
test_check("UW561S2025Example")
See this in:


- This is where your unit tests live
- Usually (but not required), I make one
.R file in the
testthat folder for every .R file in the
R folder (where your functions live)
- This helps keeps things organized – You can easily find the function
definitions (in the
R folder) and its corresponding tests
(in the testthat folder)
- First: The start of a
.R file in the
testthat folder is the context
- What is the goal of these tests in the
.R file?
- Usually, I just write: “Testing [the functions in the corresponding
.R file in the R folder]”
context("Testing compute_probabilities")
# Unit test for compute_log_probabilities
test_that("compute_probabilities outputs correctly", {
set.seed(10)
# Mock data and parameters for testing
data <- matrix(rnorm(20), nrow = 10, ncol = 2) # 10 samples, 2-dimensional
means <- matrix(c(0, 0, 5, 5), nrow = 2, byrow = TRUE) # 2 components
variances <- c(1, 2)
proportions <- c(0.5, 0.5)
probabilities <- compute_probabilities(data, means, variances, proportions)
# Test if probabilities sum to 1 for each sample
expect_true(all(abs(rowSums(probabilities) - 1) < 1e-6))
# Test if probabilities are within the valid range [0,1]
expect_true(all(probabilities >= 0 & probabilities <= 1))
# Test for handling of a single sample (edge case)
single_sample <- data[1, , drop = FALSE] # Prevent dropping to lower dimension
probabilities_single <- compute_probabilities(single_sample,
means,
variances,
proportions)
expect_true(dim(probabilities_single)[1] == 1)
expect_true(all(abs(rowSums(probabilities_single) - 1) < 1e-6))
})
- Next: Each unit test is wrapped inside a
test_that()
function.
- The first argument is always a string (usually a short sentence
describing what this test does)
- The second “argument” is a code block. It is always
sandwiched within a
{} block.
context("Testing compute_probabilities")
# Unit test for compute_log_probabilities
test_that("compute_probabilities outputs correctly", {
set.seed(10)
# Mock data and parameters for testing
data <- matrix(rnorm(20), nrow = 10, ncol = 2) # 10 samples, 2-dimensional
means <- matrix(c(0, 0, 5, 5), nrow = 2, byrow = TRUE) # 2 components
variances <- c(1, 2)
proportions <- c(0.5, 0.5)
probabilities <- compute_probabilities(data, means, variances, proportions)
# Test if probabilities sum to 1 for each sample
expect_true(all(abs(rowSums(probabilities) - 1) < 1e-6))
# Test if probabilities are within the valid range [0,1]
expect_true(all(probabilities >= 0 & probabilities <= 1))
# Test for handling of a single sample (edge case)
single_sample <- data[1, , drop = FALSE] # Prevent dropping to lower dimension
probabilities_single <- compute_probabilities(single_sample,
means,
variances,
proportions)
expect_true(dim(probabilities_single)[1] == 1)
expect_true(all(abs(rowSums(probabilities_single) - 1) < 1e-6))
})
- Inside the test, of course, you need to create the “inputs” for your
function
context("Testing compute_probabilities")
# Unit test for compute_log_probabilities
test_that("compute_probabilities outputs correctly", {
set.seed(10)
# Mock data and parameters for testing
data <- matrix(rnorm(20), nrow = 10, ncol = 2) # 10 samples, 2-dimensional
means <- matrix(c(0, 0, 5, 5), nrow = 2, byrow = TRUE) # 2 components
variances <- c(1, 2)
proportions <- c(0.5, 0.5)
probabilities <- compute_probabilities(data, means, variances, proportions)
# Test if probabilities sum to 1 for each sample
expect_true(all(abs(rowSums(probabilities) - 1) < 1e-6))
# Test if probabilities are within the valid range [0,1]
expect_true(all(probabilities >= 0 & probabilities <= 1))
# Test for handling of a single sample (edge case)
single_sample <- data[1, , drop = FALSE] # Prevent dropping to lower dimension
probabilities_single <- compute_probabilities(single_sample,
means,
variances,
proportions)
expect_true(dim(probabilities_single)[1] == 1)
expect_true(all(abs(rowSums(probabilities_single) - 1) < 1e-6))
})
- Then, of course, for the test for
compute_probabilities, we need to actually use the
compute_probabilities function.
- The output in this example is
probabilities
- We will be checking if the output
probabilities is what
we expect
context("Testing compute_probabilities")
# Unit test for compute_log_probabilities
test_that("compute_probabilities outputs correctly", {
set.seed(10)
# Mock data and parameters for testing
data <- matrix(rnorm(20), nrow = 10, ncol = 2) # 10 samples, 2-dimensional
means <- matrix(c(0, 0, 5, 5), nrow = 2, byrow = TRUE) # 2 components
variances <- c(1, 2)
proportions <- c(0.5, 0.5)
probabilities <- compute_probabilities(data, means, variances, proportions)
# Test if probabilities sum to 1 for each sample
expect_true(all(abs(rowSums(probabilities) - 1) < 1e-6))
# Test if probabilities are within the valid range [0,1]
expect_true(all(probabilities >= 0 & probabilities <= 1))
# Test for handling of a single sample (edge case)
single_sample <- data[1, , drop = FALSE] # Prevent dropping to lower dimension
probabilities_single <- compute_probabilities(single_sample,
means,
variances,
proportions)
expect_true(dim(probabilities_single)[1] == 1)
expect_true(all(abs(rowSums(probabilities_single) - 1) < 1e-6))
})
- You need to use the function
expect_true() to let R
know that you expect a particular statement to be true.
- Of course, the statement inside
expect_true() is itself code that will be evaluated
- If the output a single
TRUE boolean, then the test will
pass (but I have not yet showed you how these tests are run!)
context("Testing compute_probabilities")
# Unit test for compute_log_probabilities
test_that("compute_probabilities outputs correctly", {
set.seed(10)
# Mock data and parameters for testing
data <- matrix(rnorm(20), nrow = 10, ncol = 2) # 10 samples, 2-dimensional
means <- matrix(c(0, 0, 5, 5), nrow = 2, byrow = TRUE) # 2 components
variances <- c(1, 2)
proportions <- c(0.5, 0.5)
probabilities <- compute_probabilities(data, means, variances, proportions)
# Test if probabilities sum to 1 for each sample
expect_true(all(abs(rowSums(probabilities) - 1) < 1e-6))
# Test if probabilities are within the valid range [0,1]
expect_true(all(probabilities >= 0 & probabilities <= 1))
# Test for handling of a single sample (edge case)
single_sample <- data[1, , drop = FALSE] # Prevent dropping to lower dimension
probabilities_single <- compute_probabilities(single_sample,
means,
variances,
proportions)
expect_true(dim(probabilities_single)[1] == 1)
expect_true(all(abs(rowSums(probabilities_single) - 1) < 1e-6))
})
- Be aware. You must be in a R project where the
.Rproj file is in the top folder of your R package. That
is, your files should look something like this

- If you are not in an R project, when you run
devtools::check(), this happens. (Essentially, R doesn’t
know what R package you’re trying to test.)

- Check in-class demonstration on the
UW561S2025Example
package
- Going over
devtools::test() and
devtools::check() – what the results look like when a test
fails
- Going over what else
devtools::check() looks at
Three more things: refactoring your code
- Consider an analogy of an assembly line to make to an electric car
- It’s very hard to check the final car if you never
did any other checks
- How can you possibly look through every component under the car,
especially if they’re screwed in tightly or hard to access?
- Instead, it’ll be better to check each component
separately
- This is the idea of refactoring your code
- If you have a complicated function, you’ll want to split into
smaller functions so it’s more straightforward on how to test those
smaller functions
Alternative:
cleanup_na_matrix <- function(mat){
if(!is.matrix(mat) | !all(is.numeric(mat)))
stop("mat is not a numeric matrix")
n <- nrow(mat)
p <- ncol(mat)
mat <- sapply(1:p, function(j){
.cleanup_vector(mat[,j])
})
return(mat)
}
Third: the limitations of unit testing
- It is hard to test that a plotting function is “correct”. (I’ve
often found it’s not worth the effort to code tests for this.)
- As your functions (especially statistical estimators) get more
complicated, it’s often harder to come up with meaningful tests of
“correctness”
- We’ll do an exercise in just a bit to showcase this
- Even you wrote many tests, it does not mean:
- Your code is readable or is well documented – This was what we
discussed in Lecture 4
- Your code works in different environments
- Your code is efficiently coded (with respect to how much time it
takes to run) – This will be the focus of Lecture 6
- Your function is easy to use or has reasonable default values
- However, testing does accomplish a lot of things
nonetheless!
Words of wisdom
- Always test your functions as you
are developing your code.
- It is extremely painful to test your functions when
you’re done with your project
- Also, you’ll be very terrified if you finish your
project, and then realize some of your functions have a
critical failure
- The more you “refactor” your functions (into more manageable smaller
functions), the easier you’ll know what tests to write.
- It’s easier to write many tests for functions that each do specific
things, rather than 200+ line monstrosity
- As you fix bugs in your code, you should be writing more unit tests
to make sure that bug does not reappear in the future
In-class exercise: Bootstrap confidence intervals
- (5-10 minutes)
- In groups of 2 or 3, discuss what types of unit tests to verify that
you constructed valid boostrap confidence intervals for linear
regression
bootstrap_lm_coef <- function(X, y, alpha = 0.05, nboot = 1000){
# something is done here
}
- Do not do this alone! I want to hear
discussions
- Use the TinyURL (to a Google Docs) to write down some of the ways
- (Link to be created live in class)
- Observe: The bootstrap is a random procedure, so it’s hard to test
if the code is “correct” because it’s not easy to construct a simple
setting where you know what the “correct” answer is
- However, you do know that the intervals should get
wider as
alpha gets smaller
- Hence, you can compare extreme cases: The intervals for
alpha=0.5 (i.e., 50% confidence interval) should be much
smaller than the intervals for alpha=0.01 (i.e., 99%
confidence interlva) for the vast majority of instances
- That is, you can show your method is “correct” by comparing the
relation between different outputs
- Most complex statistical methods have some type of “monotonic”
behavior (although it might be probabilistic in nature)
- If you design your own classifier, you can create
inputs that you know are “easy” or that are “hard”, and you can see that
the classifier has a higher accuracy on the “easy” settings than the
“hard” settings
- In the maximal clique question in your homework, you can test that
if you add edges to a graph, the maximal clique should never get
smaller
## Warning: package 'igraph' was built under R version 4.3.3

List of types of unit-tests
Testing for completeness:
- #1: Checking that the function outputs something that is the correct
type
- #2: Simple checks to make sure outputs are within the correct
range
- #3: Making sure the function runs on many different inputs
- I often randomly generate inputs for this
- It’s useful to make sure your outputs can be
different. If your function suspiciously always returns the same output
regardless of what your input is, your function probably is’t
correct
- #4: Testing to make sure it errors when expected
- Usually, my tests for this are purposefully putting in “bad”
inputs
Testing for correctness:
- #5: Making sure the function gets the correct answer for carefully
crafted problem
- Often, this becomes more than #6, since I do this for only simple
settings
- #6: Testing to make sure the function handles corner cases
gracefully
- #7: Testing the behavior when there is deliberately no unique answer
- This is very important when it’s applicable.
- It often requires you to know something mathematical about your
function, where you carefully cook up a setting where there’s not a
unique correct answer
- #8: Comparing against another known implementation
- #9: Comparing against another one of your implementations (possibly
much more computationally intensive, but more transparent and definitely
correct)
- I use this a lot. Often, I have a “fast but very
opaque” implementation of a function that I’m going to use at scale for
my real project. My test would contain a “obviously correct when you
read the code, but it’s very slow” implementation. Then, I just make
sure both implementations yield the same input on a wide diversity of
inputs.
- This is very common for functions that do crazy matrix math –
sometimes it’s easy to code the “wrong” matrix equation, but it’s also
easy to code a cumbersome
for loop that (mathematically)
does the same thing but is very slow
- #10: (Math) Exploiting some mathematical property of your problem
- This is as close as you can get to a gold standard,
but it’s often not possible
- This is usually: 1) exploiting some invariance in your function, or
2) checking the optimality conditions of an optimization problem
- #11: (Math) Comparing the relations between two different outputs of
your function, given two different inputs
- This is what most of my “more sophisicated” tests do
Elements #8 through #11 could all benefit from randomly generated
inputs since you don’t really need to know what the exact output is to
write a useful test!
- Remember, most sophisicated functions you will write will
not actually have a “correct” answer that it’s easy to
derive
- Finding a maximal partial clique (Eeks! Your HW3)
- Training a deep learning neural network
- Hypothesis testing and confidence intervals
- Methods that involve some type of randomness (typically as an
initialization)
- However, you have to get creative on how to test that your
implementation is reasonable (since it’s impossible to
guarantee it’s “correct”)
- If you do not know how to test your code, then you don’t
really understand your own code
Learn from other coders!
- I personally learn a lot reading how the authors
themselves of R packages I use write their unit tests
With the remaining time…
- Go over the rest of HW3
- I’m going to live-demo how to code with ChatGPT
- The maximal partial clique is a bit of passion project of mine. It
was the foundation of my first PhD paper and I haven’t had a great
excuse to revisit this idea many years later
- From “Lin, Kevin Z., Han Liu, and Kathryn Roeder.”Covariance-based
sample selection for heterogeneous data: Applications to gene expression
and autism risk gene detection.” Journal of the American Statistical
Association 116.533 (2021): 54-67.”, https://github.com/linnykos/covarianceSelection
- If you’re interested in chatting about the possible statistical uses
of the maximal partial clique, come chat with me over the summer!
